Enforce DataFrame display memory limits with max_rows + min_rows constraint (deprecate repr_rows)#1367
Enforce DataFrame display memory limits with max_rows + min_rows constraint (deprecate repr_rows)#1367kosiew wants to merge 15 commits intoapache:mainfrom
max_rows + min_rows constraint (deprecate repr_rows)#1367Conversation
…and adjust default values
…pdate related validations
…olve max_rows logic
…formatter memory limits
max_rows and enforcing min_rows_display <= max_rowsmax_rows + min_rows constraint (deprecate repr_rows)
timsaucer
left a comment
There was a problem hiding this comment.
Thank you for taking this on. I think the new solution is very nice!
| min_rows_display=20, # Minimum number of rows to display | ||
| repr_rows=10, # Number of rows to display in __repr__ |
There was a problem hiding this comment.
It looks like the default here has min_rows > max_rows. Also should we have consistent naming of the two? Either min_rows and max_rows or min_rows_display and max_rows_display?
I think the _display was differentiating what happens during a display() call vs __repr__ but I think these values get used during both calls.
There was a problem hiding this comment.
I'll change to use min_row, max_rows.
| min_rows_display=50, # Always show at least 50 rows | ||
| repr_rows=20 # Show 20 rows in __repr__ output |
There was a problem hiding this comment.
Same as above, difference between _display and without the trailer and also we have here min_rows > max_rows
| Minimum number of rows to display. | ||
| repr_rows : int, default 10 | ||
| Default number of rows to display in repr output. | ||
| min_rows_display : int, default 10 |
There was a problem hiding this comment.
It's not about this PR per se, but maybe this is an opportunity to tighten up the comments here. We're repeating ourselves with the types and defaults. Those are already in the type hints. I think it's becoming customary to not duplicate that information and the argument line is the preferred place to keep it. That way we don't have to worry about maintaining the values in two places.
There was a problem hiding this comment.
I'll refactor docstring to follow NumPy/pandas style without type/default duplication.
| """ | ||
| self._max_rows = value | ||
|
|
||
| @property |
There was a problem hiding this comment.
If repr_rows is being deprecated, why add an accessor?
There was a problem hiding this comment.
I added the accessors for backward compatibility during the deprecation period:
Rationale:
- User code may directly access the property: Code like
formatter.repr_rows = 20continue working during the deprecation period - Graceful migration path: Users get a warning but their code doesn't break
- Custom formatter implementations: External code that inherits from the formatter and accesses
repr_rowsdirectly will continue to work
Shall we keep the accessors for now with the deprecation warnings, plan removal in next major version?
| @repr_rows.setter | ||
| def repr_rows(self, value: int) -> None: | ||
| """Set the maximum number of rows using deprecated name. | ||
|
|
python/tests/test_dataframe.py
Outdated
|
|
||
| def test_html_formatter_repr_rows(df, clean_formatter_state): | ||
| configure_formatter(min_rows_display=2, repr_rows=2) | ||
| def test_html_formatter_memory_boundary_conditions(df, clean_formatter_state): |
There was a problem hiding this comment.
Maybe switch to large_df instead of df here?
There was a problem hiding this comment.
Great suggestion! The large_df fixture (100,000 rows) is much better suited for testing memory boundary conditions than the standard df
python/tests/test_dataframe.py
Outdated
|
|
||
| # Get the raw size of the data to test boundary conditions | ||
| # First, capture output with no limits | ||
| configure_formatter(max_memory_bytes=10 * MB, min_rows_display=1, max_rows=100) |
There was a problem hiding this comment.
If you do switch to large_df then I think it may go above the 100 limit you have
There was a problem hiding this comment.
I'll adjust this higher.
python/tests/test_dataframe.py
Outdated
| unrestricted_rows = count_table_rows(unrestricted_output) | ||
|
|
||
| # Test 1: Very small memory limit should still respect min_rows | ||
| configure_formatter(max_memory_bytes=10, min_rows_display=1) |
There was a problem hiding this comment.
I think a better test is one where we do hit the memory limit well before we hit the min number of rows, hence the recommendation to switch to large_df. Actually, maybe we want a different dataframe, one where we know we have multiple batches instead of a single batch. The thing this isn't doing is verifying we've ended the stream early, but I think that would have to be a rust test instead of a pytest.
There was a problem hiding this comment.
I'll create tests for early stream termination behavior with multi-batch DataFrames.
src/dataframe.rs
Outdated
| let max_bytes = get_attr(formatter, "max_memory_bytes", default_config.max_bytes); | ||
| let min_rows = get_attr(formatter, "min_rows_display", default_config.min_rows); | ||
| let repr_rows = get_attr(formatter, "repr_rows", default_config.repr_rows); | ||
| let max_rows = get_attr(formatter, "max_rows", default_config.max_rows); |
There was a problem hiding this comment.
Since users may have provided their own implementation for the formatter, I think we need to first try getting max_rows. If that fails, try getting repr_rows. If that fails, take default. When we remove repr_rows entirely after it's been deprecated for a few releases, then we can revert to this simpler logic.
There was a problem hiding this comment.
I'll add the fallback for repr_rows.
…date related tests
Removed type annotations and redundant default values from parameter names. Enhanced descriptions for clarity and added context for usage. Fixed formatting for the documentation sections to improve readability.
…on with memory limits
Which issue does this PR close?
Rationale for this change
Large DataFrames could ignore the configured
max_memory_byteslimit during display.Previously the defaults (
repr_rows=10,min_rows_display=20) meant the collection loop conditionrows_so_far < min_rowsstayed true even after exceeding the memory budget, causing significantly more data to be streamed/collected than intended.This PR resolves that by:
max_rowssetting (replacingrepr_rows).min_rows_display) cannot exceed the maximum rows cap.repr_rowsso existing users aren’t broken immediately.What changes are included in this PR?
Docs: Update user guide examples to use
max_rowsinstead ofrepr_rows.Python formatter API:
Add
max_rowsas the primary configuration for limiting displayed rows.Keep
repr_rowsas a deprecated alias (constructor arg + property), emittingDeprecationWarning.Add centralized validation via
_validate_formatter_parameters():min_rows_display <= max_rows.repr_rowsandmax_rowsare provided with different values.Store resolved value internally as
_max_rowsand exposemax_rows/ deprecatedrepr_rowsproperties.Add
max_rowstoconfigure_formatter()allowed keys.Rust display/streaming logic:
repr_rows->max_rows.min_rows20 → 10) to avoid violating the min/max relationship.min_rows <= max_rows.(memory && max_rows)or until the guaranteedmin_rowsis reached, with clearer comments.Are these changes tested?
Yes.
Updated existing formatter tests to use
max_rows.Added new tests for:
Memory-limit boundary conditions (tiny budget, default budget, large budget, and min-rows override).
repr_rowsbackward compatibility:DeprecationWarningwhen used.max_rows.max_rows.Validation failures for invalid
max_rowsand formin_rows_display > max_rows.Are there any user-facing changes?
Yes.
New option:
max_rowsis now the preferred way to cap rows displayed in repr/HTML output.Deprecation:
repr_rowsis deprecated and will emit aDeprecationWarning.repr_rowscontinues to work.repr_rowsandmax_rowswith different values raises aValueError.Behavioral change: Default minimum rows displayed changes from 20 to 10.
Docs: Updated examples and clarified that
min_rows_displaymust be<= max_rows.If the deprecation/rename is considered a public API change, please add the
api changelabel.LLM-generated code disclosure
This PR includes code and comments generated with assistance from an LLM. All LLM-generated content has been manually reviewed and tested.